We want to use the RStudio interface to control git. I created a RStudio Project, I wrote stuff and I hope I’ll add some nice stat graphs :).
This course looks very interesting. But, I’m really worried now because my laptop broke down since friday. I hope I can recover my data.
*update: one week later, the pc from my work is working, I recovered my data, but I do not trust it anymore :(
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Wed Dec 09 11:38:06 2020"
REPOSITORY:
“There is a pleasure in the pathless woods,
There is a rapture on the lonely shore,
There is society, where none intrudes,
By the deep Sea, and music in its roar:
I love not Man the less, but Nature more,
From these our interviews, in which I steal
From all I may be, or have been before,
To mingle with the Universe, and feel
What I can ne’er express, yet cannot all conceal.”
by Lord Byron
───▄▀▀▀▄▄▄▄▄▄▄▀▀▀▄───
───█▒▒░░░░░░░░░▒▒█───
────█░░█░░░░░█░░█────
─▄▄──█░░░▀█▀░░░█──▄▄─
█░░█─▀▄░░░░░░░▄▀─█░░█
first session: creating a data set.
data source: survey of Approaches to Learning…
Kimmo Vehkalahti: ASSIST 2014 - Phase 3 (end of Part 2), N=183 Course: Johdatus yhteiskuntatilastotieteeseen, syksy 2014 (Introduction to Social Statistics, fall 2014 - in Finnish), international survey of Approaches to Learning, made possible by Teachers’ Academy funding for KV in 2013-2015.
Dataset structure: 7 variables and 166 observations
Several questions were asked in the survey, three learning approaches resulted from the combination of similar questions (deep, surfing and strategical approach).
data <-learning2014 <- read.csv("~/GitHub/IODS-project/IODS-project/data_AKMV/learning2014.csv")
#data<-read.csv("learning2014.csv")
str(data)
## 'data.frame': 166 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : chr "F" "M" "F" "M" ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
Gender: 110 are female (F) and 56 are men(M)
Age: from 17 to 55 years old (25 is the mean)
Attitude: 3.2 (1.4-5)
Deep: 3.68 (1.583-4.917)
Strategy: 3.121 (1.250 - 5)
Surfing: 2.787 (1.583-4.333)
Points: 22.72 (7 - 33)
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
ggpairs(data, mapping = aes(colour=gender), legend = 1,
title="Fig.1 Variables overview",
lower = list(combo = wrap("facethist",size=0.1, bins = 20, alpha=0.3)))+
theme(legend.position="bottom")
Age of the participants, deep learning approach and the points obtained in the final test are non-normal distributed (Fig.1).
LINNEAR REGRESSION MODEL
mod1<-lm(Points~Age+attitude+deep+surf+stra+gender, data=data)
summary(mod1)
##
## Call:
## lm(formula = Points ~ Age + attitude + deep + surf + stra + gender,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4506 -3.2479 0.3879 3.5243 11.4820
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.30246 5.25723 3.481 0.000644 ***
## Age -0.09607 0.05396 -1.780 0.076929 .
## attitude 3.44300 0.59776 5.760 4.24e-08 ***
## deep -1.04279 0.78438 -1.329 0.185606
## surf -1.10891 0.84467 -1.313 0.191132
## stra 0.95871 0.55150 1.738 0.084083 .
## genderM -0.05858 0.92985 -0.063 0.949845
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.265 on 159 degrees of freedom
## Multiple R-squared: 0.2312, Adjusted R-squared: 0.2021
## F-statistic: 7.967 on 6 and 159 DF, p-value: 1.599e-07
attitude, age and strategy are the most relevant variables
mod2<-lm(Points~Age+attitude+stra, data=data)
summary(mod2)
##
## Call:
## lm(formula = Points ~ Age + attitude + stra, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1149 -3.2003 0.3303 3.4129 10.7599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89543 2.64834 4.114 6.17e-05 ***
## Age -0.08822 0.05302 -1.664 0.0981 .
## attitude 3.48077 0.56220 6.191 4.72e-09 ***
## stra 1.00371 0.53434 1.878 0.0621 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 1.07e-08
The equation obtained is:
points=10.89 + 3.48attitude + 1strategy approach - 0.08*age (+ or - 5.26)
With 20.3% of accuracy we can calculate the obtained points in the final evaluation by the equation showed in model 2 (mod2), Attitude is the most important variable to get more points. The importance of this model is based on the fact that we can explain and detect the most significant variables in the learning process rather than predict the points.
#Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
par(mfrow=c(2,2))
plot(mod2)
The graphs are not fancy but important, it shows that the accuracy reported in the model is trustworthy.
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). more information: https://archive.ics.uci.edu/ml/datasets/Student+Performance
The joined data set used in the analysis exercise combines the two student alcohol consumption data sets. The following adjustments have been made:
The variables not used for joining the two data have been combined by averaging (including the grade variables) ‘alc_use’ is the average of ‘Dalc’ (diary) and ‘Walc’ (week end consumption) ‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise
library(readxl)
pormath <- read_excel("data_AKMV/pormath.xlsx")
str(pormath)
## tibble [370 x 51] (S3: tbl_df/tbl/data.frame)
## $ school : chr [1:370] "GP" "GP" "GP" "GP" ...
## $ sex : chr [1:370] "F" "F" "F" "F" ...
## $ age : num [1:370] 15 15 15 15 15 15 15 15 15 15 ...
## $ address : chr [1:370] "R" "R" "R" "R" ...
## $ famsize : chr [1:370] "GT3" "GT3" "GT3" "GT3" ...
## $ Pstatus : chr [1:370] "T" "T" "T" "T" ...
## $ Medu : num [1:370] 1 1 2 2 3 3 3 2 3 3 ...
## $ Fedu : num [1:370] 1 1 2 4 3 4 4 2 1 3 ...
## $ Mjob : chr [1:370] "at_home" "other" "at_home" "services" ...
## $ Fjob : chr [1:370] "other" "other" "other" "health" ...
## $ reason : chr [1:370] "home" "reputation" "reputation" "course" ...
## $ guardian : chr [1:370] "mother" "mother" "mother" "mother" ...
## $ traveltime: num [1:370] 2 1 1 1 2 1 2 2 2 1 ...
## $ studytime : num [1:370] 4 2 1 3 3 3 3 2 4 4 ...
## $ schoolsup : chr [1:370] "yes" "yes" "yes" "yes" ...
## $ famsup : chr [1:370] "yes" "yes" "yes" "yes" ...
## $ activities: chr [1:370] "yes" "no" "yes" "yes" ...
## $ nursery : chr [1:370] "yes" "no" "yes" "yes" ...
## $ higher : chr [1:370] "yes" "yes" "yes" "yes" ...
## $ internet : chr [1:370] "yes" "yes" "no" "yes" ...
## $ romantic : chr [1:370] "no" "yes" "no" "no" ...
## $ famrel : num [1:370] 3 3 4 4 4 4 4 4 4 4 ...
## $ freetime : num [1:370] 1 3 3 3 2 3 2 1 4 3 ...
## $ goout : num [1:370] 2 4 1 2 1 2 2 3 2 3 ...
## $ Dalc : num [1:370] 1 2 1 1 2 1 2 1 2 1 ...
## $ Walc : num [1:370] 1 4 1 1 3 1 2 3 3 1 ...
## $ health : num [1:370] 1 5 2 5 3 5 5 4 3 4 ...
## $ n : num [1:370] 2 2 2 2 2 2 2 2 2 2 ...
## $ id.p : num [1:370] 1096 1073 1040 1025 1166 ...
## $ id.m : num [1:370] 2096 2073 2040 2025 2153 ...
## $ failures : num [1:370] 0 1 0 0 1 0 1 0 0 0 ...
## $ paid : chr [1:370] "yes" "no" "no" "no" ...
## $ absences : num [1:370] 3 2 8 2 5 2 0 1 9 10 ...
## $ G1 : num [1:370] 10 10 14 10 12 12 11 10 16 10 ...
## $ G2 : num [1:370] 12 8 13 10 12 12 6 10 16 10 ...
## $ G3 : num [1:370] 12 8 12 9 12 12 6 10 16 10 ...
## $ failures.p: num [1:370] 0 0 0 0 0 0 0 0 0 0 ...
## $ paid.p : chr [1:370] "yes" "no" "no" "no" ...
## $ absences.p: num [1:370] 4 2 8 2 2 2 0 0 6 10 ...
## $ G1.p : num [1:370] 13 13 14 10 13 11 10 11 15 10 ...
## $ G2.p : num [1:370] 13 11 13 11 13 12 11 10 15 10 ...
## $ G3.p : num [1:370] 13 11 12 10 13 12 12 11 15 10 ...
## $ failures.m: num [1:370] 1 2 0 0 2 0 2 0 0 0 ...
## $ paid.m : chr [1:370] "yes" "no" "yes" "yes" ...
## $ absences.m: num [1:370] 2 2 8 2 8 2 0 2 12 10 ...
## $ G1.m : num [1:370] 7 8 14 10 10 12 12 8 16 10 ...
## $ G2.m : num [1:370] 10 6 13 9 10 12 0 9 16 11 ...
## $ G3.m : num [1:370] 10 5 13 8 10 11 0 8 16 11 ...
## $ alc_use : num [1:370] 1 3 1 1 2.5 1 2 2 2.5 1 ...
## $ high_use : logi [1:370] FALSE TRUE FALSE FALSE TRUE FALSE ...
## $ cid : num [1:370] 3001 3002 3003 3004 3005 ...
#classmate's script:
library(tidyr); library(dplyr); library(ggplot2); library(gridExtra); library(readr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
alc <- read_csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt")
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_character(),
## age = col_double(),
## Medu = col_double(),
## Fedu = col_double(),
## traveltime = col_double(),
## studytime = col_double(),
## failures = col_double(),
## famrel = col_double(),
## freetime = col_double(),
## goout = col_double(),
## Dalc = col_double(),
## Walc = col_double(),
## health = col_double(),
## absences = col_double(),
## G1 = col_double(),
## G2 = col_double(),
## G3 = col_double(),
## alc_use = col_double(),
## high_use = col_logical()
## )
## i Use `spec()` for the full column specifications.
data <- mutate(alc, ID = row_number()) %>%
select(any_of(c("ID", "high_use", "absences", "health", "freetime", "famrel")))
p1 <- ggplot(data, aes(absences)) + stat_count(geom="line", aes(colour=high_use) )
p2 <- ggplot(data, aes(health)) + geom_bar(aes(fill=high_use), position="dodge")
p3 <- ggplot(data, aes(freetime)) + geom_bar(aes(fill=high_use), position="dodge")
p4 <- ggplot(data, aes(famrel)) + geom_bar(aes(fill=high_use), position="dodge")
grid.arrange(p1, p2, p3, p4, nrow=2, ncol=2)
date()
## [1] "Wed Dec 09 11:38:36 2020"
The data set this week is about: Housing values in suburbs of Boston (source:https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html) there are several variables that affect the housing values such as: criminality, infrastructure, density, size of the house…
crim: per capita crime rate by town.
zn: proportion of residential land zoned for lots over 25,000 sq.ft.
indus: proportion of non-retail business acres per town.
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox: nitrogen oxides concentration (parts per 10 million).
rm: average number of rooms per dwelling.
age: proportion of owner-occupied units built prior to 1940.
dis: weighted mean of distances to five Boston employment centres.
rad: index of accessibility to radial highways.
tax: full-value property-tax rate per $10,000.
ptratio: pupil-teacher ratio by town.
black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat: lower status of the population (percent).
medv: median value of owner-occupied homes in $1000s.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(corrplot)
## corrplot 0.84 loaded
library(ggplot2)
library(GGally)
library (plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
data("Boston")
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
cor<-cor(Boston, method = "pearson", use = "complete.obs")
round(cor, 2)
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
corrplot.mixed(cor, lower = "ellipse", upper="number", tl.col = "black", tl.srt = 45)
Bigger houses are more expensive (number of rooms r=0.7), the house value is lower for lower status people (r=0.74). There are intercorrelation between the variables that can explain the houses values.
Here we subtract the column means from the corresponding columns and divide the difference with standard deviation, all the variables have a mean=0.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
class(boston_scaled)
## [1] "matrix" "array"
boston_scaled <- as.data.frame(boston_scaled)
creating a quantile vector of crime
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
Dividing the dataset to train and test sets, so that 80% of the data belongs to the train set.
ncol <- nrow(boston_scaled)
ind <- sample(ncol, size = ncol * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
Fitting the linear discriminant analysis (LDA) and LDA (bi)plot.
LDA is a classification (and dimension reduction) method. It finds the (linear) combination of the variables that separate the target variable classes. The target can be binary or multiclass variable.
lda.fit <- lda(crime ~ ., data = train)
lda.arrows <- function(x, myscale = 3, arrow_heads = 0.1, color = "orange", tex = 1.25, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)}# the function for lda biplot arrows (datacamp)
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 3)
The plot shows that there are more crimes in areas with close access to the highway and in the residential zones, as showed in the correlation matrix.
correct_classes<-test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test) #Test=the rest=20%
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 15 12 1 0
## med_low 6 19 1 0
## med_high 1 17 9 1
## high 0 0 1 19
Similar to the LDA, more crimes are expected in high and low class neighbourhoods.
The optimal number of clusters is -> 2
data("Boston")
boston_scaled <- as.data.frame(scale(Boston))
dis<-dist(boston_scaled)
summary(dis)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
set.seed(123)
k_max <- 20
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})# calculate the total within sum of squares
qplot(x = 1:k_max, y = twcss, geom = 'line')
Visualization of the clusters per variable:
km <-kmeans(boston_scaled, centers = 2)
ggpairs(boston_scaled, mapping = aes(colour=as.factor(km$cluster)), legend = 1,
upper = list(continuous =wrap("cor", size=3)),
title="clusters overview",
lower = list(combo = wrap("facethist",size=0.1, bins = 20, alpha=0.3)))+
theme(legend.position="bottom")
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
km2 <-kmeans(boston_scaled, centers= 3)
boston_scaled$km_cluster<-km2$cluster
lda.fit2 <- lda(km_cluster ~ ., data = boston_scaled)# linear discriminant analysis
lda.fit2
## Call:
## lda(km_cluster ~ ., data = boston_scaled)
##
## Prior probabilities of groups:
## 1 2 3
## 0.3241107 0.4664032 0.2094862
##
## Group means:
## crim zn indus chas nox rm
## 1 0.8046456 -0.4872402 1.117990 0.01575144 1.1253988 -0.46443119
## 2 -0.3760908 -0.3417123 -0.296848 0.01127561 -0.3345884 -0.09228038
## 3 -0.4075892 1.5146367 -1.068814 -0.04947434 -0.9962503 0.92400834
## age dis rad tax ptratio black
## 1 0.79737580 -0.85425848 1.2219249 1.2954050 0.60580719 -0.6407268
## 2 -0.02966623 0.05695857 -0.5803944 -0.6030198 -0.08691245 0.2863040
## 3 -1.16762641 1.19486951 -0.5983266 -0.6616391 -0.74378342 0.3538816
## lstat medv
## 1 0.8719904 -0.68418954
## 2 -0.1801190 0.03577844
## 3 -0.9480974 0.97889973
##
## Coefficients of linear discriminants:
## LD1 LD2
## crim 0.03134296 0.14880455
## zn 0.06381527 1.22350515
## indus -0.61086696 0.10402980
## chas -0.01953161 -0.03579238
## nox -1.00230143 0.70464917
## rm 0.16285767 0.44390394
## age 0.07220634 -0.59785382
## dis 0.04270475 0.45498614
## rad -0.71987743 0.02882054
## tax -0.98285440 0.70663319
## ptratio -0.22527977 0.15514668
## black 0.01693595 -0.03181845
## lstat -0.18274033 0.50122677
## medv 0.02892966 0.64244841
##
## Proportion of trace:
## LD1 LD2
## 0.8409 0.1591
lda biplot
classes <- as.numeric(boston_scaled$km_cluster)# target classes as numeric
plot(lda.fit2, dimen = 2, col = classes, pch = classes)# plot the lda results
lda.arrows(lda.fit, myscale = 3)
There differences between the clusters, cluster 3 is better explain by rad and tax variables, while cluster one by age.
creating K-means for the training data color by crime insidence
model_predictors <- dplyr::select(train, -crime)
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling # matrix multiplication
matrix_product <- as.data.frame(matrix_product)# matrix multiplication
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3,
color=train$crime, type= 'scatter3d', mode='markers')
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
Creation of 2 clusters:
data("Boston")
boston_scaled <- as.data.frame(scale(Boston))
train <- boston_scaled[ind,]
str(train)
## 'data.frame': 404 obs. of 14 variables:
## $ crim : num -0.3923 -0.4183 0.0275 -0.4087 0.4549 ...
## $ zn : num -0.487 3.372 -0.487 -0.487 -0.487 ...
## $ indus : num -0.211 -1.077 1.015 2.116 1.015 ...
## $ chas : num -0.272 -0.272 3.665 -0.272 -0.272 ...
## $ nox : num 0.262 -1.387 1.858 0.227 1.366 ...
## $ rm : num -0.378 1.664 0.157 -0.577 0.188 ...
## $ age : num -0.116 -1.221 0.797 0.967 1.056 ...
## $ dis : num -0.658 1.207 -0.613 -0.849 -0.765 ...
## $ rad : num -0.408 -0.752 1.66 -0.867 1.66 ...
## $ tax : num -0.102 -0.974 1.529 -1.307 1.529 ...
## $ ptratio: num 0.344 -1.18 0.806 0.298 0.806 ...
## $ black : num 0.441 0.325 0.38 0.249 -0.575 ...
## $ lstat : num 0.0374 -1.3364 0.0864 0.6899 0.9322 ...
## $ medv : num -0.1449 2.3341 -0.0906 -0.4059 -1.0365 ...
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
km3 <-kmeans(train, centers= 2)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3,
color=as.factor(km3$cluster), type= 'scatter3d', mode='markers')
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Creation of 3 clusters:
data("Boston")
boston_scaled <- as.data.frame(scale(Boston))
train <- boston_scaled[ind,]
km4 <-kmeans(train, centers= 3)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3,
color=as.factor(km4$cluster), type= 'scatter3d', mode='markers')
library(readr)
library(dplyr)
library(corrplot)
library(GGally)
library(FactoMineR)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggplot2)
library(tidyr)
library(MASS)
date()
## [1] "Wed Dec 09 11:39:53 2020"
Description of ‘human’ dataset variables.
Original data from: http://hdr.undp.org/en/content/human-development-index-hdi
Retrieved, modified and analyzed by Tuomo Nieminen 2017
The data combines several indicators from most countries in the world “Country” = Country name
There are in total 8 variables explaining human development, considered here:
Related to health and knowledge:
EDU.EDP: Expected years of schooling, LIFE.EXP: Life expectancy at birth, GNI: The gross national income Index, MAT.MOR: Maternal mortality, ADO.BIRTH: Adolescent birth ratio.
Related to empowerment:
EDU2.F: Percentage of females having secondary education, LABO.F: Percentage of females being employed, PARLI.F: Female representation in parliament.
human <- read_csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt")
##
## -- Column specification --------------------------------------------------------
## cols(
## Edu2.FM = col_character(),
## Labo.FM = col_double(),
## Edu.Exp = col_double(),
## Life.Exp = col_double(),
## GNI = col_double(),
## Mat.Mor = col_double(),
## Ado.Birth = col_double(),
## Parli.F = col_double()
## )
## Warning: 155 parsing failures.
## row col expected actual file
## 1 -- 8 columns 9 columns 'http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt'
## 2 -- 8 columns 9 columns 'http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt'
## 3 -- 8 columns 9 columns 'http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt'
## 4 -- 8 columns 9 columns 'http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt'
## 5 -- 8 columns 9 columns 'http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt'
## ... ... ......... ......... ........................................................................................
## See problems(...) for more details.
str(human)
## tibble [155 x 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Edu2.FM : chr [1:155] "Norway" "Australia" "Switzerland" "Denmark" ...
## $ Labo.FM : num [1:155] 1.007 0.997 0.983 0.989 0.969 ...
## $ Edu.Exp : num [1:155] 0.891 0.819 0.825 0.884 0.829 ...
## $ Life.Exp : num [1:155] 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ GNI : num [1:155] 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ Mat.Mor : num [1:155] 64992 42261 56431 44025 45435 ...
## $ Ado.Birth: num [1:155] 4 6 6 5 6 7 9 28 11 8 ...
## $ Parli.F : num [1:155] 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## - attr(*, "problems")= tibble [155 x 5] (S3: tbl_df/tbl/data.frame)
## ..$ row : int [1:155] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ col : chr [1:155] NA NA NA NA ...
## ..$ expected: chr [1:155] "8 columns" "8 columns" "8 columns" "8 columns" ...
## ..$ actual : chr [1:155] "9 columns" "9 columns" "9 columns" "9 columns" ...
## ..$ file : chr [1:155] "'http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt'" "'http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt'" "'http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt'" "'http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt'" ...
## - attr(*, "spec")=
## .. cols(
## .. Edu2.FM = col_character(),
## .. Labo.FM = col_double(),
## .. Edu.Exp = col_double(),
## .. Life.Exp = col_double(),
## .. GNI = col_double(),
## .. Mat.Mor = col_double(),
## .. Ado.Birth = col_double(),
## .. Parli.F = col_double()
## .. )
Correlation between mentioned variables:
human <- read.csv("~/GitHub/IODS-project/IODS-project/data_AKMV/human2.txt")
ggpairs(human)
PCA raw data:
human <- read.csv("~/GitHub/IODS-project/IODS-project/data_AKMV/human2.txt")
pca_human<-prcomp(human)
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
contribution to the dimensions:
res.pca1 <- prcomp(human, center=TRUE, scale = TRUE)
fviz_pca_var(res.pca1, col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),repel = TRUE, title="Not-standardized variables")
we can see that PC1 explains 53.6% of the systems while PC2 only 16.2%., the variables that better explain the system are: life expectancy, maternal mortality and education expectancy. Similarly in PC2 is the presence of working females.
PCA of standardized variables:
human_std <- scale(human)
pca_human <- prcomp(human_std)
biplot(pca_human,repel=TRUE, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in plot.window(...): "repel" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "repel" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "repel" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "repel" is not a
## graphical parameter
## Warning in box(...): "repel" is not a graphical parameter
## Warning in title(...): "repel" is not a graphical parameter
## Warning in text.default(x, xlabs, cex = cex[1L], col = col[1L], ...): "repel" is
## not a graphical parameter
## Warning in plot.window(...): "repel" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "repel" is not a graphical parameter
## Warning in title(...): "repel" is not a graphical parameter
## Warning in axis(3, col = col[2L], ...): "repel" is not a graphical parameter
## Warning in axis(4, col = col[2L], ...): "repel" is not a graphical parameter
## Warning in text.default(y, labels = ylabs, cex = cex[2L], col = col[2L], :
## "repel" is not a graphical parameter
contribution to the dimensions:
res.pca <- prcomp(human_std, center=TRUE, scale = TRUE)
fviz_pca_var(res.pca, col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),repel = TRUE, title="Standardized variables")
The relevance of the variables does not changed, but how the countries are distributed does.
The data used here concern a questionnaire on tea. We asked to 300 individuals how they drink tea (18 questions), what are their product’s perception (12 questions) and some personal details (4 questions).
(https://rdrr.io/cran/FactoMineR/man/tea.html)
Overview of the data:
data(tea)
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time3 <- dplyr::select(tea, one_of(keep_columns))
gather(tea_time3) %>% ggplot(aes(value)) +
facet_wrap("key", scales = "free") + geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
Making the MCA graphs:
mca <- MCA(tea_time3, graph = FALSE)
fviz_mca_var(mca, col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE,
ggtheme = theme_minimal())
Having a look at the variables
fviz_contrib(mca, choice = "var", axes = 1, top = 15)
library(readr)
library(dplyr)
library(corrplot)
library(GGally)
library(FactoMineR)
library(factoextra)
library(ggplot2)
library(tidyr)
library(MASS)
date()
## [1] "Wed Dec 09 11:40:21 2020"
Description of the datasets variables
BPRS <- read_table2("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt")
##
## -- Column specification --------------------------------------------------------
## cols(
## treatment = col_double(),
## subject = col_double(),
## week0 = col_double(),
## week1 = col_double(),
## week2 = col_double(),
## week3 = col_double(),
## week4 = col_double(),
## week5 = col_double(),
## week6 = col_double(),
## week7 = col_double(),
## week8 = col_double()
## )
str(BPRS)
## tibble [40 x 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ treatment: num [1:40] 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : num [1:40] 1 2 3 4 5 6 7 8 9 10 ...
## $ week0 : num [1:40] 42 58 54 55 72 48 71 30 41 57 ...
## $ week1 : num [1:40] 36 68 55 77 75 43 61 36 43 51 ...
## $ week2 : num [1:40] 36 61 41 49 72 41 47 38 39 51 ...
## $ week3 : num [1:40] 43 55 38 54 65 38 30 38 35 55 ...
## $ week4 : num [1:40] 41 43 43 56 50 36 27 31 28 53 ...
## $ week5 : num [1:40] 40 34 28 50 39 29 40 26 22 43 ...
## $ week6 : num [1:40] 38 28 29 47 32 33 30 26 20 43 ...
## $ week7 : num [1:40] 47 28 25 42 38 27 31 25 23 39 ...
## $ week8 : num [1:40] 51 28 24 46 32 25 31 24 21 32 ...
## - attr(*, "spec")=
## .. cols(
## .. treatment = col_double(),
## .. subject = col_double(),
## .. week0 = col_double(),
## .. week1 = col_double(),
## .. week2 = col_double(),
## .. week3 = col_double(),
## .. week4 = col_double(),
## .. week5 = col_double(),
## .. week6 = col_double(),
## .. week7 = col_double(),
## .. week8 = col_double()
## .. )
rats <- read_table2("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt")
##
## -- Column specification --------------------------------------------------------
## cols(
## `"ID"` = col_character(),
## `"Group"` = col_double(),
## `"WD1"` = col_double(),
## `"WD8"` = col_double(),
## `"WD15"` = col_double(),
## `"WD22"` = col_double(),
## `"WD29"` = col_double(),
## `"WD36"` = col_double(),
## `"WD43"` = col_double(),
## `"WD44"` = col_double(),
## `"WD50"` = col_double(),
## `"WD57"` = col_double(),
## `"WD64"` = col_double()
## )
## Warning: 16 parsing failures.
## row col expected actual file
## 1 -- 13 columns 14 columns 'https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt'
## 2 -- 13 columns 14 columns 'https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt'
## 3 -- 13 columns 14 columns 'https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt'
## 4 -- 13 columns 14 columns 'https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt'
## 5 -- 13 columns 14 columns 'https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt'
## ... ... .......... .......... ......................................................................................
## See problems(...) for more details.
str(rats)
## tibble [16 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ "ID" : chr [1:16] "\"1\"" "\"2\"" "\"3\"" "\"4\"" ...
## $ "Group": num [1:16] 1 2 3 4 5 6 7 8 9 10 ...
## $ "WD1" : num [1:16] 1 1 1 1 1 1 1 1 2 2 ...
## $ "WD8" : num [1:16] 240 225 245 260 255 260 275 245 410 405 ...
## $ "WD15" : num [1:16] 250 230 250 255 260 265 275 255 415 420 ...
## $ "WD22" : num [1:16] 255 230 250 255 255 270 260 260 425 430 ...
## $ "WD29" : num [1:16] 260 232 255 265 270 275 270 268 428 440 ...
## $ "WD36" : num [1:16] 262 240 262 265 270 275 273 270 438 448 ...
## $ "WD43" : num [1:16] 258 240 265 268 273 277 274 265 443 460 ...
## $ "WD44" : num [1:16] 266 243 267 270 274 278 276 265 442 458 ...
## $ "WD50" : num [1:16] 266 244 267 272 273 278 271 267 446 464 ...
## $ "WD57" : num [1:16] 265 238 264 274 276 284 282 273 456 475 ...
## $ "WD64" : num [1:16] 272 247 268 273 278 279 281 274 468 484 ...
## - attr(*, "problems")= tibble [16 x 5] (S3: tbl_df/tbl/data.frame)
## ..$ row : int [1:16] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ col : chr [1:16] NA NA NA NA ...
## ..$ expected: chr [1:16] "13 columns" "13 columns" "13 columns" "13 columns" ...
## ..$ actual : chr [1:16] "14 columns" "14 columns" "14 columns" "14 columns" ...
## ..$ file : chr [1:16] "'https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt'" "'https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt'" "'https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt'" "'https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt'" ...
## - attr(*, "spec")=
## .. cols(
## .. `"ID"` = col_character(),
## .. `"Group"` = col_double(),
## .. `"WD1"` = col_double(),
## .. `"WD8"` = col_double(),
## .. `"WD15"` = col_double(),
## .. `"WD22"` = col_double(),
## .. `"WD29"` = col_double(),
## .. `"WD36"` = col_double(),
## .. `"WD43"` = col_double(),
## .. `"WD44"` = col_double(),
## .. `"WD50"` = col_double(),
## .. `"WD57"` = col_double(),
## .. `"WD64"` = col_double()
## .. )
#transformation BPRS -> weeks
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
#.....
BPRSL <- BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)
BPRSL <- BPRSL %>% mutate(week = as.integer(substr(weeks,5,5)))
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "we...
## $ bprs <dbl> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 6...
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))